Is Taylor Swift a Harbinger of Doom (R,SQL,Python)

Project Members: Brandon Malady, Anusha Ravi

Project Outline

Abstract: This project aims to explore the potential connection between Taylor Swift’s rise in Stardom and disaster in the United States. As Swift continues to dominate the airwaves, one must ask: could her soaring success be a smokescreen for something far more sinister?

Data Sources: FBI — Crime in the U.S. Center, Center for Disease Control and Prevention (National Center for Health Statistics) , Google Trends Data “Taylor Swift”

Aim 1: Explore the relationship between Taylor Swift’s popularity and online presence with Violent Crime rates over time in the United States.

Using data collected by the FBI we will explore the potential connection between Swift’s stardom and violent crime rates in the United States. Our data set from the FBI Is broken down per year. This will provide excellent temporal resolution. We will examine the aggregate Taylor Swift interest over time and violent crime rates to see if their is a connection.

Aim 2: Investigate deaths from drug overdoses in the United States and its correlation with Taylor Swift’s popularity over time.

Using Data collected by the CDC we will explore the relationship between Taylor Swift’s Popularity and drug overdose deaths in the United States. The data is broken down by year which will ensure we can examine temporal relationships between Taylor’s music and drug overdosages in the American Population.

Data Sources & Screenshots

Taylor Swift Google Trends Data

Google Trends Data for term “Taylor Swift” All time Example of raw data from Google Trends for term “Taylor Swift” All time

Taylor Swift Popularity Data: Source, Google Trends. Numbers represent search interest relative to the highest point on the chart for the United States from 2004-2024. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.

FBI Crime Data

FBI Data Source Page FBI Raw Data

Violent Crime Rate Data: Source, FBI. Violent crime rate represent total reported violent crimes (murder, rape, aggravated assault, robbery) reported per 100,000 members of the popultion. I.E. Crime rate of 500 means 500 violent crimes reported per 100,000 population.

CDC Drug Data

CDC Raw Overdose Data

Drug Overdose Data: Source, Center for Disease Control and Prevention, National Center for Health Statistics

Wrangled Data

Wrangled & Combined data of interest: 2004-2013 Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity. Sources: FBI, CDC, Google Trends “Taylor Swift”.

library(dplyr);

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse);
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.5.1     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Taylor Swift Google Trends Data & Wrangling

#load in the Google Trends data on "Taylor Swift"
library(readr);
Taylor_Interest_Over_Time <- read_csv("Taylor_Interest_Over_Time.csv", 
    skip = 1)
Rows: 250 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Month, taylor swift: (United States)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(Taylor_Interest_Over_Time)
# A tibble: 250 × 2
   Month   `taylor swift: (United States)`
   <chr>   <chr>                          
 1 2004-01 0                              
 2 2004-02 0                              
 3 2004-03 0                              
 4 2004-04 0                              
 5 2004-05 0                              
 6 2004-06 0                              
 7 2004-07 0                              
 8 2004-08 0                              
 9 2004-09 0                              
10 2004-10 0                              
# ℹ 240 more rows
#skips data label as this is not the data, but is part of the raw CSV format 
# Wrangle the data to be more usable
result <- Taylor_Interest_Over_Time %>% #Seperate Wider to get "YYYY" Column
  # Split the "Month" column into year and month. Current format is "YYYY-MM"
  separate("Month", into = c("year", "Month_new"), sep = "-", convert = TRUE) %>%
  # Rename the popularity metric column
  rename(Popularity = `taylor swift: (United States)`) %>%
  # Convert "<1" to 0.5 in the Popularity column to avoid calculation errors. Approximation
  mutate(Popularity = if_else(Popularity == "<1", 0.5, as.numeric(Popularity)))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Popularity = if_else(Popularity == "<1", 0.5,
  as.numeric(Popularity))`.
Caused by warning in `if_else()`:
! NAs introduced by coercion
# Aggregate Taylor Swift Data monthly data to yearly sums
Taylor_swift_popularity_yearly_sums <- result %>%
  group_by(year) %>%
  summarise(total_count = sum(Popularity)) %>%
  arrange(year);
# Print the result
print(Taylor_swift_popularity_yearly_sums)
# A tibble: 21 × 2
    year total_count
   <int>       <dbl>
 1  2004         0  
 2  2005         0.5
 3  2006        15  
 4  2007        52  
 5  2008       122  
 6  2009       265  
 7  2010       220  
 8  2011       194  
 9  2012       235  
10  2013       220  
# ℹ 11 more rows
# Generate a "synonym" by converting "2004" type values for year to 
"Two Thousand and Four"
[1] "Two Thousand and Four"
# Chat GPT was used with prompt "I want to convert the year column from displaying "2001" to "two thousand and one" for each year" to generate the following function

number_to_words <- function(num) {
  ones <- c("", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
  tens <- c("", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
  teens <- c("ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
  
  thousands <- num %/% 1000
  remainder <- num %% 1000
  
  result <- if (thousands > 0) paste(ones[thousands + 1], "thousand") else ""
  
  if (remainder >= 100) {
    hundreds <- remainder %/% 100
    result <- paste(result, ones[hundreds + 1], "hundred")
    remainder <- remainder %% 100
  }
  
  if (remainder > 0) {
    if (nchar(result) > 0) result <- paste(result, "and")
    if (remainder < 20) {
      result <- paste(result, if (remainder < 10) ones[remainder + 1] else teens[remainder - 9])
    } else {
      result <- paste(result, tens[remainder %/% 10 + 1], ones[remainder %% 10 + 1])
    }
  }
  
  return(trimws(result))
}

Taylor_swift_popularity_yearly_sums <- Taylor_swift_popularity_yearly_sums %>%
  mutate(year = sapply(year, function(x) tools::toTitleCase(number_to_words(as.numeric(x)))))
print(Taylor_swift_popularity_yearly_sums)
# A tibble: 21 × 2
   year                      total_count
   <chr>                           <dbl>
 1 Two Thousand and Four             0  
 2 Two Thousand and Five             0.5
 3 Two Thousand and Six             15  
 4 Two Thousand and Seven           52  
 5 Two Thousand and Eight          122  
 6 Two Thousand and Nine           265  
 7 Two Thousand and Ten            220  
 8 Two Thousand and Eleven         194  
 9 Two Thousand and Twelve         235  
10 Two Thousand and Thirteen       220  
# ℹ 11 more rows

The above table will be our pseudo synonym data source. “Two Thousand and Four” is a synonym for “2004” which we will wrangle before a subsequent Join operation into a master data table.

# Create a function to map the numeric year values to language format
year_to_numeric <- function(year_text) {
  year_mapping <- c(
    "Two Thousand and Four" = 2004, "Two Thousand and Five" = 2005,
    "Two Thousand and Six" = 2006, "Two Thousand and Seven" = 2007,
    "Two Thousand and Eight" = 2008, "Two Thousand and Nine" = 2009,
    "Two Thousand and Ten" = 2010, "Two Thousand and Eleven" = 2011,
    "Two Thousand and Twelve" = 2012, "Two Thousand and Thirteen" = 2013
  )
  return(year_mapping[year_text])
}

# Apply the language to numeric "YYYY" conversion 
Taylor_swift_popularity_yearly_sums <-Taylor_swift_popularity_yearly_sums %>%
  mutate(year = sapply(year, year_to_numeric))

# Display the result
print(Taylor_swift_popularity_yearly_sums)
# A tibble: 21 × 2
    year total_count
   <dbl>       <dbl>
 1  2004         0  
 2  2005         0.5
 3  2006        15  
 4  2007        52  
 5  2008       122  
 6  2009       265  
 7  2010       220  
 8  2011       194  
 9  2012       235  
10  2013       220  
# ℹ 11 more rows

This sentence format for years has been handled and now all our synonyms are ready for subsequent joins.

FBI Crime Data & Wrangling

#Load FBI Violent Crime Data
library(readxl)
FBI_violent_crime_per_year <- read_excel("FBI_violent_crime_per_year.xls", 
    skip = 3)
New names:
• `` -> `...23`
• `` -> `...24`
print(FBI_violent_crime_per_year)
# A tibble: 27 × 24
   Year  Population1 `Violent\ncrime2` `Violent \ncrime \nrate`
   <chr>       <dbl>             <dbl>                    <dbl>
 1 2000    281421906           1425486                     506.
 2 20015   285317559           1439480                     504.
 3 2002    287973924           1423677                     494.
 4 2003    290788976           1383676                     476.
 5 2004    293656842           1360088                     463.
 6 2005    296507061           1390745                     469 
 7 2006    299398484           1435123                     479.
 8 2007    301621157           1422970                     472.
 9 2008    304059724           1394461                     459.
10 2009    307006550           1325896                     432.
# ℹ 17 more rows
# ℹ 20 more variables: `Murder and\nnonnegligent \nmanslaughter` <dbl>,
#   `Murder and \nnonnegligent \nmanslaughter \nrate` <dbl>,
#   `Rape\n(revised \ndefinition)3` <dbl>,
#   `Rape\n(revised \ndefinition) \nrate3` <dbl>,
#   `Rape\n(legacy \ndefinition)4` <dbl>,
#   `Rape\n(legacy \ndefinition) \nrate4` <dbl>, Robbery <dbl>, …
#get only the data of interest (Year & Crime Rate)
new_data_frame <- FBI_violent_crime_per_year |>
  select("Year", "Violent \ncrime \nrate") |>
  rename(year = Year, crime_rate = `Violent \ncrime \nrate`)
# Drop the last 7 rows as these are footnotes in the raw data table (see png)
FBI_Crime_Rate <- head(new_data_frame, -7)
# Fix the typos in the year column (typos introduced by footnotes)
FBI_Crime_Rate$year <- ifelse(FBI_Crime_Rate$year == 20186, 2018, FBI_Crime_Rate$year)
FBI_Crime_Rate$year <- ifelse(FBI_Crime_Rate$year == 20015, 2001, FBI_Crime_Rate$year)
# Convert the year column to numeric for subsequent join
FBI_Crime_Rate$year <- as.numeric(FBI_Crime_Rate$year)
# View the result 
print(FBI_Crime_Rate)
# A tibble: 20 × 2
    year crime_rate
   <dbl>      <dbl>
 1  2000       506.
 2  2001       504.
 3  2002       494.
 4  2003       476.
 5  2004       463.
 6  2005       469 
 7  2006       479.
 8  2007       472.
 9  2008       459.
10  2009       432.
11  2010       404.
12  2011       387.
13  2012       388.
14  2013       369.
15  2014       362.
16  2015       374.
17  2016       387.
18  2017       384.
19  2018       370.
20  2019       367.
# Rename columns in the combined dataframe. Combining Taylor Swift and FBI data
combined_data <- FBI_Crime_Rate %>%
  inner_join(Taylor_swift_popularity_yearly_sums, by = "year") %>%
  rename(
    `FBI Violent Crime` = crime_rate,
    `Taylor Swift Popularity` = total_count
  )
# View the result
print(combined_data)
# A tibble: 10 × 3
    year `FBI Violent Crime` `Taylor Swift Popularity`
   <dbl>               <dbl>                     <dbl>
 1  2004                463.                       0  
 2  2005                469                        0.5
 3  2006                479.                      15  
 4  2007                472.                      52  
 5  2008                459.                     122  
 6  2009                432.                     265  
 7  2010                404.                     220  
 8  2011                387.                     194  
 9  2012                388.                     235  
10  2013                369.                     220  

CDC (National Center for Health Statistics) Drug Data & Wrangling

Drug Overdose Data: Source, Center for Disease Control and Prevention, National Center for Health Statistics

# Drug_Use_Data
# Load and preview the CDC Drug Data
library(readxl)
Overdose_table <- read_excel("Overdose table.xlsx", 
    skip = 2) #Skip top labels as this is not the data 
New names:
• `Number of deaths` -> `Number of deaths...2`
• `Deaths per 100,000` -> `Deaths per 100,000...3`
• `` -> `...4`
• `` -> `...5`
• `Number of deaths` -> `Number of deaths...6`
• `` -> `...7`
• `Deaths per 100,000` -> `Deaths per 100,000...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `Number of deaths` -> `Number of deaths...12`
• `` -> `...13`
• `Deaths per 100,000` -> `Deaths per 100,000...14`
library(dplyr)
library(stringr)
# Remove "....." from year entries from raw formatting 
Overdose_table <- Overdose_table %>%
  mutate(Year = as.integer(str_replace_all(Year, "[^\\d]", ""))) #convert to numeric for subsuquent join
library(dplyr)
# Rename the "Year" column to "year"
Overdose_table <- Overdose_table %>%
  rename(year = Year) #making cases the same for subsequent Join
# Display the first few rows 
print(Overdose_table)
# A tibble: 21 × 14
    year `Number of deaths...2` `Deaths per 100,000...3` ...4  ...5 
   <int>                  <dbl>                    <dbl> <lgl> <lgl>
 1  2001                  19394                      6.8 NA    NA   
 2  2002                  23518                      8.2 NA    NA   
 3  2003                  25785                      8.9 NA    NA   
 4  2004                  27424                      9.4 NA    NA   
 5  2005                  29813                     10.1 NA    NA   
 6  2006                  34425                     11.5 NA    NA   
 7  2007                  36010                     11.9 NA    NA   
 8  2008                  36450                     11.9 NA    NA   
 9  2009                  37004                     11.9 NA    NA   
10  2010                  38329                     12.3 NA    NA   
# ℹ 11 more rows
# ℹ 9 more variables: `Number of deaths...6` <dbl>, ...7 <lgl>,
#   `Deaths per 100,000...8` <dbl>, ...9 <lgl>, ...10 <lgl>, ...11 <lgl>,
#   `Number of deaths...12` <dbl>, ...13 <lgl>, `Deaths per 100,000...14` <dbl>
#Extract only Total deaths and Year columns 
Overdose_table_subset <- Overdose_table %>%
  select(1:2) %>%
  rename(year = 1, number_of_deaths = 2)  # Rename columns for clarity
# Display the first few rows of the new data frame
print(head(Overdose_table_subset))
# A tibble: 6 × 2
   year number_of_deaths
  <int>            <dbl>
1  2001            19394
2  2002            23518
3  2003            25785
4  2004            27424
5  2005            29813
6  2006            34425

Combined Data from all 3 sources, Master Table for Analysis

library(dplyr)
# Join with Previous combined data with CDC data to create master table
TaylorSwift_ViolentCrime_DrugOverdose <- combined_data %>%
  left_join(Overdose_table_subset, by = "year") %>%
  rename(`Drug Overdoses` = number_of_deaths)
# Display the first few rows of the joined dataframe
print(TaylorSwift_ViolentCrime_DrugOverdose)
# A tibble: 10 × 4
    year `FBI Violent Crime` `Taylor Swift Popularity` `Drug Overdoses`
   <dbl>               <dbl>                     <dbl>            <dbl>
 1  2004                463.                       0              27424
 2  2005                469                        0.5            29813
 3  2006                479.                      15              34425
 4  2007                472.                      52              36010
 5  2008                459.                     122              36450
 6  2009                432.                     265              37004
 7  2010                404.                     220              38329
 8  2011                387.                     194              41340
 9  2012                388.                     235              41502
10  2013                369.                     220              43982

Combined data of interest: (2004-2019) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”

Data Analysis & Visualization

#Scale all data to range from 0-100 for graphical visualization of relative trends per dataset
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = year, y = value, color = variable)) +
  geom_line(size = 1) +            # Line plot
  geom_point(size = 3) +           # Add points for each data point
  labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses Over Time",
       x = "Year", y = "Scaled Values") +
  theme_minimal() +                # Clean theme
  scale_color_manual(values = c("red", "pink", "black")) + # Custom colors
  theme(legend.title = element_blank()) + # Remove legend title
  theme(panel.grid.major = element_line(color = "gray", size = 0.5)) # Add gridlines
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Figure(1) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”

#Same comparison but visualized with grouped bar chart
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the grouped bar plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
  labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses by Year",
       x = "Year", 
       y = "Scaled Values") +
  scale_fill_manual(values = c("FBI Violent Crime" = "red", 
                               "Taylor Swift Popularity" = "pink", 
                               "Drug Overdoses" = "black"),
                    name = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(limits = c(0, 100)) +
  geom_text(aes(label = round(value, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3)

Figure(2) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift

# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Select only the columns for 'Taylor Swift Popularity' and 'Drug Overdoses'
scaled_data_subset <- scaled_data %>%
  select(year, `Taylor Swift Popularity`, `Drug Overdoses`)
# Step 3: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data_subset %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 4: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
  labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses by Year",
       x = "Year", 
       y = "Scaled Values") +
  scale_fill_manual(values = c("Taylor Swift Popularity" = "pink", 
                               "Drug Overdoses" = "black"),
                    name = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(limits = c(0, 100)) +
  geom_text(aes(label = round(value, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3)

Figure(3) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, Substance Abuse and Mental Health Services Administration, Google Trends “Taylor Swift

Here we see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.

#Show correlation between drug overdose and popularity with a scatterplot/trendline
# Create the plot
ggplot(scaled_data, aes(x = year)) +
  # Scatter plot for Taylor Swift Popularity
  geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
  # Scatter plot for Drug Overdoses
  geom_point(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), size = 3, shape = 15) +
  # Fitted line for Taylor Swift Popularity
  geom_smooth(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), 
              method = "lm", se = FALSE, linetype = "dashed") +
  # Fitted line for Drug Overdoses
  geom_smooth(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), 
              method = "lm", se = FALSE, linetype = "dashed") +
  # Customize colors
  scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "Drug Overdoses" = "black")) +
  # Labels and title
  labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses Over Time",
       x = "Year",
       y = "Scaled Value (0 to 100)",
       color = "") +
  # Customize the theme
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.major = element_line(color = "gray", linetype = "dotted"),
        axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  # Set y-axis limits
  scale_y_continuous(limits = c(0, 100)) +
  # Set x-axis to show all years
  scale_x_continuous(breaks = scaled_data$year)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_smooth()`).

Figure(4) Scatterplot with trendline showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift

We again see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.

# Create the plot
ggplot(scaled_data, aes(x = year)) +
  # Line and points for Taylor Swift Popularity
  geom_line(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 1) +
  geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
  # Line and points for Violent Crime Rate
  geom_line(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 1) +
  geom_point(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 3, shape = 15) +
  # Customize colors
  scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "FBI Violent Crime" = "red")) +
  # Labels and title
  labs(title = "Scaled Values of Taylor Swift Popularity and FBI Violent Crime Over Time",
       x = "Year",
       y = "Scaled Value (0 to 100)",
       color = "") +
  # Customize the theme
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid = element_blank(),  # Remove all grid lines
        axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  # Set y-axis limits
  scale_y_continuous(limits = c(0, 100)) +
  # Set x-axis to show all years
  scale_x_continuous(breaks = scaled_data$year)

Figure(5) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift

Shockingly Ms. Swifts Music seems to have a calming affect on American citizens. Perhaps Taylor is sending Americans into a drug-ridden stupor.

Altogether,this work provides unequivocal evidence of Taylor Swifts corruption of America. Ms. Swift’s music lulls masses into a a drug ridden complaceny.

Future work needs to investigate the negative correlation between Ms. Swift’s music and Violent Crime Rate in America. One plausible explanation is that Ms. Swift’s music triggers downregulation of testosterone production in males, leading to reduced violent crime.

library(DBI)
library(RSQLite)
library(readxl)
# Create a connection to an SQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Load Taylor Swift Popularity Data
Taylor_Interest <- read.csv("Taylor_Interest_Over_Time.csv", skip = 1)
dbWriteTable(con, "Taylor_Interest", Taylor_Interest)

# Load FBI Violent Crime Data
FBI_Crime <- read_excel("FBI_violent_crime_per_year.xls", skip = 3)
New names:
• `` -> `...23`
• `` -> `...24`
dbWriteTable(con, "FBI_Crime", FBI_Crime)

# Load Drug Overdose Data
Overdose_Data <- read_excel("Overdose table.xlsx", skip = 2)
New names:
• `Number of deaths` -> `Number of deaths...2`
• `Deaths per 100,000` -> `Deaths per 100,000...3`
• `` -> `...4`
• `` -> `...5`
• `Number of deaths` -> `Number of deaths...6`
• `` -> `...7`
• `Deaths per 100,000` -> `Deaths per 100,000...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `Number of deaths` -> `Number of deaths...12`
• `` -> `...13`
• `Deaths per 100,000` -> `Deaths per 100,000...14`
dbWriteTable(con, "Overdose_Data", Overdose_Data)
-- Get the Raw Taylor Swift Google Trends data
SELECT * FROM Taylor_Interest
Displaying records 1 - 10
Month taylor.swift…United.States.
2004-01 0
2004-02 0
2004-03 0
2004-04 0
2004-05 0
2004-06 0
2004-07 0
2004-08 0
2004-09 0
2004-10 0
-- Seperate Wider to get year and month columns. Handle <1 values with 0.5 Approximation for calculations
CREATE TABLE Taylor_Interest_Transformed AS
SELECT 
    CAST(substr(Month, 1, 4) AS INTEGER) AS year,
    CAST(substr(Month, 6, 2) AS INTEGER) AS Month_new,
    CASE 
        WHEN `taylor.swift...United.States.` = '<1' THEN 0.5
        ELSE CAST(`taylor.swift...United.States.` AS REAL)
    END AS Popularity
FROM 
    Taylor_Interest;
-- Get the changed Taylor Swift Google Trends data
SELECT * FROM Taylor_Interest_Transformed 
Displaying records 1 - 10
year Month_new Popularity
2004 1 0
2004 2 0
2004 3 0
2004 4 0
2004 5 0
2004 6 0
2004 7 0
2004 8 0
2004 9 0
2004 10 0
-- Aggregate the totals after grouping by year
SELECT 
    year,
    SUM(Popularity) AS Total_Interest
FROM 
    Taylor_Interest_Transformed
GROUP BY 
    year
ORDER BY 
    year;
Displaying records 1 - 10
year Total_Interest
2004 0.0
2005 0.5
2006 15.0
2007 52.0
2008 122.0
2009 265.0
2010 220.0
2011 194.0
2012 235.0
2013 220.0
-- Create the table to store for later 
CREATE TABLE Taylor_Swift_Interest_Clean AS
SELECT 
        year,
        SUM(Popularity) AS total_count
    FROM 
        Taylor_Interest_Transformed
    GROUP BY 
        year
    ORDER BY 
        year;
-- Get the FBI data
SELECT * FROM FBI_crime

Table: Displaying records 1 - 10

|Year | Population1| Violent crime2| Violent crime rate| Murder and nonnegligent manslaughter| Murder and nonnegligent manslaughter rate| Rape (revised definition)3| Rape (revised definition) rate3| Rape (legacy definition)4| Rape (legacy definition) rate4| Robbery| Robbery rate| Aggravated assault| Aggravated assault rate| Property crime| Property crime rate| Burglary| Burglary rate| Larceny- theft| Larceny- theft rate| Motor vehicle theft| Motor vehicle theft rate| …23| …24| |:—–|———–:|————-:|——————:|———————————–:|—————————————–:|————————-:|——————————:|————————:|—————————–:|——-:|————:|——————:|———————–:|————–:|——————-:|——–:|————-:|————-:|——————:|——————-:|————————:|—–:|—–:| |2000 | 281421906| 1425486| 506.5| 15586| 5.5| NA| NA| 90178| 32.0| 408016| 145.0| 911706| 324.0| 10182584| 3618.3| 2050992| 728.8| 6971590| 2477.3| 1160002| 412.2| NA| NA| |20015 | 285317559| 1439480| 504.5| 16037| 5.6| NA| NA| 90863| 31.8| 423557| 148.5| 909023| 318.6| 10437189| 3658.1| 2116531| 741.8| 7092267| 2485.7| 1228391| 430.5| NA| NA| |2002 | 287973924| 1423677| 494.4| 16229| 5.6| NA| NA| 95235| 33.1| 420806| 146.1| 891407| 309.5| 10455277| 3630.6| 2151252| 747.0| 7057379| 2450.7| 1246646| 432.9| NA| NA| |2003 | 290788976| 1383676| 475.8| 16528| 5.7| NA| NA| 93883| 32.3| 414235| 142.5| 859030| 295.4| 10442862| 3591.2| 2154834| 741.0| 7026802| 2416.5| 1261226| 433.7| NA| NA| |2004 | 293656842| 1360088| 463.2| 16148| 5.5| NA| NA| 95089| 32.4| 401470| 136.7| 847381| 288.6| 10319386| 3514.1| 2144446| 730.3| 6937089| 2362.3| 1237851| 421.5| NA| NA| |2005 | 296507061| 1390745| 469.0| 16740| 5.6| NA| NA| 94347| 31.8| 417438| 140.8| 862220| 290.8| 10174754| 3431.5| 2155448| 726.9| 6783447| 2287.8| 1235859| 416.8| NA| NA| |2006 | 299398484| 1435123| 479.3| 17309| 5.8| NA| NA| 94472| 31.6| 449246| 150.0| 874096| 292.0| 10019601| 3346.6| 2194993| 733.1| 6626363| 2213.2| 1198245| 400.2| NA| NA| |2007 | 301621157| 1422970| 471.8| 17128| 5.7| NA| NA| 92160| 30.6| 447324| 148.3| 866358| 287.2| 9882212| 3276.4| 2190198| 726.1| 6591542| 2185.4| 1100472| 364.9| NA| NA| |2008 | 304059724| 1394461| 458.6| 16465| 5.4| NA| NA| 90750| 29.8| 443563| 145.9| 843683| 277.5| 9774152| 3214.6| 2228887| 733.0| 6586206| 2166.1| 959059| 315.4| NA| NA| |2009 | 307006550| 1325896| 431.9| 15399| 5.0| NA| NA| 89241| 29.1| 408742| 133.1| 812514| 264.7| 9337060| 3041.3| 2203313| 717.7| 6338095| 2064.5| 795652| 259.2| NA| NA|

-- View Data structure. The column names have hidden characters so you'll need to use this to correctly wrangle data
PRAGMA table_info(FBI_crime);
Displaying records 1 - 10
cid name type notnull dflt_value pk
0 Year TEXT 0 NA 0
1 Population1 REAL 0 NA 0
2 Violent
crime2 REAL 0 NA 0
3 Violent

crime rate |REAL | 0|NA | 0| |4 |Murder and nonnegligent manslaughter |REAL | 0|NA | 0| |5 |Murder and nonnegligent manslaughter rate |REAL | 0|NA | 0| |6 |Rape (revised definition)3 |REAL | 0|NA | 0| |7 |Rape (revised definition) rate3 |REAL | 0|NA | 0| |8 |Rape (legacy definition)4 |REAL | 0|NA | 0| |9 |Rape (legacy definition) rate4 |REAL | 0|NA | 0|

-- Extract the data of interest (Year & Crime Rate)
SELECT 
    Year AS year,
    `Violent 
crime 
rate` AS crime_rate
FROM 
    FBI_crime;
Displaying records 1 - 10
year crime_rate
2000 506.5
20015 504.5
2002 494.4
2003 475.8
2004 463.2
2005 469.0
2006 479.3
2007 471.8
2008 458.6
2009 431.9
-- Extract the data of interest (Year & Crime Rate) and number the rows for the ability to remove the last 7 rows which are just footnotes. 
WITH numbered_rows AS (
    SELECT 
        Year AS year,
        `Violent 
crime 
rate` AS crime_rate,
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS row_num
    FROM FBI_crime
    WHERE CAST(Year AS INTEGER) IS NOT NULL
)
SELECT year, crime_rate
FROM numbered_rows
WHERE row_num <= (SELECT COUNT(*) FROM numbered_rows) - 7
Displaying records 1 - 10
year crime_rate
2000 506.5
20015 504.5
2002 494.4
2003 475.8
2004 463.2
2005 469.0
2006 479.3
2007 471.8
2008 458.6
2009 431.9
-- Create the clean data from the FBI data and save it for later Join, fix typos from footnotes 
CREATE TABLE FBI_Crime_Clean AS
WITH numbered_rows AS (
    SELECT 
        CASE 
            WHEN Year = 20015 THEN 2001
            WHEN Year = 20186 THEN 2018
            ELSE Year
        END AS year,
        `Violent 
crime 
rate` AS crime_rate,
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS row_num
    FROM FBI_crime
    WHERE CAST(Year AS INTEGER) IS NOT NULL
)
SELECT year, crime_rate
FROM numbered_rows
WHERE row_num <= (SELECT COUNT(*) FROM numbered_rows) - 7

Wrangled FBI Crime data ready for use.

-- Get the CDC data
SELECT * FROM Overdose_Data
Displaying records 1 - 10
Year Number of deaths…2 Deaths per 100,000…3 …4 …5 Number of deaths…6 …7 Deaths per 100,000…8 …9 …10 …11 Number of deaths…12 …13 Deaths per 100,000…14
2001. . . . . . . . 19394 6.8 NA NA 12658 NA 9.0 NA NA NA 6736 NA 4.6
2002. . . . . . . . 23518 8.2 NA NA 15028 NA 10.6 NA NA NA 8490 NA 5.8
2003. . . . . . . . 25785 8.9 NA NA 16399 NA 11.5 NA NA NA 9386 NA 6.4
2004. . . . . . . . 27424 9.4 NA NA 17120 NA 11.8 NA NA NA 10304 NA 6.9
2005. . . . . . . . 29813 10.1 NA NA 18724 NA 12.8 NA NA NA 11089 NA 7.3
2006. . . . . . . . 34425 11.5 NA NA 21893 NA 14.8 NA NA NA 12532 NA 8.2
2007. . . . . . . . 36010 11.9 NA NA 22298 NA 14.9 NA NA NA 13712 NA 8.8
2008. . . . . . . . 36450 11.9 NA NA 22468 NA 14.9 NA NA NA 13982 NA 8.9
2009. . . . . . . . 37004 11.9 NA NA 22593 NA 14.8 NA NA NA 14411 NA 9.1
2010. . . . . . . . 38329 12.3 NA NA 23006 NA 15.0 NA NA NA 15323 NA 9.6
-- Remove the "....." formatting of the raw data
UPDATE Overdose_Data
SET Year = CAST(REPLACE(REPLACE(REPLACE(Year, '.', ''), ' ', ''), '-', '') AS INTEGER);
-- Get the data
Select * FROM Overdose_Data
Displaying records 1 - 10
Year Number of deaths…2 Deaths per 100,000…3 …4 …5 Number of deaths…6 …7 Deaths per 100,000…8 …9 …10 …11 Number of deaths…12 …13 Deaths per 100,000…14
2001 19394 6.8 NA NA 12658 NA 9.0 NA NA NA 6736 NA 4.6
2002 23518 8.2 NA NA 15028 NA 10.6 NA NA NA 8490 NA 5.8
2003 25785 8.9 NA NA 16399 NA 11.5 NA NA NA 9386 NA 6.4
2004 27424 9.4 NA NA 17120 NA 11.8 NA NA NA 10304 NA 6.9
2005 29813 10.1 NA NA 18724 NA 12.8 NA NA NA 11089 NA 7.3
2006 34425 11.5 NA NA 21893 NA 14.8 NA NA NA 12532 NA 8.2
2007 36010 11.9 NA NA 22298 NA 14.9 NA NA NA 13712 NA 8.8
2008 36450 11.9 NA NA 22468 NA 14.9 NA NA NA 13982 NA 8.9
2009 37004 11.9 NA NA 22593 NA 14.8 NA NA NA 14411 NA 9.1
2010 38329 12.3 NA NA 23006 NA 15.0 NA NA NA 15323 NA 9.6
-- Mutate year to lowercase year for subsequent join
ALTER TABLE Overdose_Data
RENAME COLUMN Year TO year;
-- View the data
SELECT *
FROM Overdose_Data
LIMIT 5;
5 records
year Number of deaths…2 Deaths per 100,000…3 …4 …5 Number of deaths…6 …7 Deaths per 100,000…8 …9 …10 …11 Number of deaths…12 …13 Deaths per 100,000…14
2001 19394 6.8 NA NA 12658 NA 9.0 NA NA NA 6736 NA 4.6
2002 23518 8.2 NA NA 15028 NA 10.6 NA NA NA 8490 NA 5.8
2003 25785 8.9 NA NA 16399 NA 11.5 NA NA NA 9386 NA 6.4
2004 27424 9.4 NA NA 17120 NA 11.8 NA NA NA 10304 NA 6.9
2005 29813 10.1 NA NA 18724 NA 12.8 NA NA NA 11089 NA 7.3
-- Get the data of interest and create a table
CREATE TABLE Overdose_Data_Clean AS
SELECT 
    year,
    `Number of deaths...2` AS number_of_deaths
FROM Overdose_Data;
-- Get the wrangled CDC data
SELECT * FROM Overdose_Data_Clean
Displaying records 1 - 10
year number_of_deaths
2001 19394
2002 23518
2003 25785
2004 27424
2005 29813
2006 34425
2007 36010
2008 36450
2009 37004
2010 38329
-- Join the CDC overdose data and FBI violent crime data
SELECT 
    o.year,
    o.number_of_deaths AS overdose_deaths,
    f.crime_rate
FROM 
    Overdose_Data_Clean o
INNER JOIN 
    FBI_Crime_Clean f
ON 
    o.year = f.year;
Displaying records 1 - 10
year overdose_deaths crime_rate
2002 23518 494.4
2003 25785 475.8
2004 27424 463.2
2005 29813 469.0
2006 34425 479.3
2007 36010 471.8
2008 36450 458.6
2009 37004 431.9
2010 38329 404.5
2011 41340 387.1
-- Join all 3 data sets and create the master table for analysis
CREATE TABLE FBI_Drugs_TaylorSwift AS
SELECT 
    o.year,
    o.number_of_deaths AS overdose_deaths,
    f.crime_rate,
    t.total_count AS taylor_swift_interest
FROM 
    Overdose_Data_Clean o
INNER JOIN 
    FBI_Crime_Clean f ON o.year = f.year
INNER JOIN 
    Taylor_Swift_Interest_Clean t ON o.year = t.year;
-- SHow the master table
SELECT * FROM FBI_Drugs_TaylorSwift
Displaying records 1 - 10
year overdose_deaths crime_rate taylor_swift_interest
2004 27424 463.2 0.0
2005 29813 469.0 0.5
2006 34425 479.3 15.0
2007 36010 471.8 52.0
2008 36450 458.6 122.0
2009 37004 431.9 265.0
2010 38329 404.5 220.0
2011 41340 387.1 194.0
2012 41502 387.8 235.0
2013 43982 369.1 220.0

Combined data of interest: (2004-2013) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift”

import pandas as pd

Taylor_Interest_Over_Time = pd.read_csv("Taylor_Interest_Over_Time.csv", skiprows=1)
print(Taylor_Interest_Over_Time)
       Month taylor swift: (United States)
0    2004-01                             0
1    2004-02                             0
2    2004-03                             0
3    2004-04                             0
4    2004-05                             0
..       ...                           ...
245  2024-06                            36
246  2024-07                            31
247  2024-08                            31
248  2024-09                            40
249  2024-10                            27

[250 rows x 2 columns]
import pandas as pd

# Transform the data to be more usable
result = Taylor_Interest_Over_Time.copy()

# Split the "Month" column into year and month
result[['year', 'Month_new']] = result['Month'].str.split('-', expand=True)

# Rename the popularity metric column
result = result.rename(columns={'taylor swift: (United States)': 'Popularity'})

# Convert "<1" to 0.5 in the Popularity column
result['Popularity'] = result['Popularity'].replace('<1', '0.5')
result['Popularity'] = pd.to_numeric(result['Popularity'])

# Convert year to numeric type
result['year'] = pd.to_numeric(result['year'])

print(result)
       Month  Popularity  year Month_new
0    2004-01         0.0  2004        01
1    2004-02         0.0  2004        02
2    2004-03         0.0  2004        03
3    2004-04         0.0  2004        04
4    2004-05         0.0  2004        05
..       ...         ...   ...       ...
245  2024-06        36.0  2024        06
246  2024-07        31.0  2024        07
247  2024-08        31.0  2024        08
248  2024-09        40.0  2024        09
249  2024-10        27.0  2024        10

[250 rows x 4 columns]
import pandas as pd

# Aggregate Taylor Swift Data monthly data to yearly sums
Taylor_swift_popularity_yearly_sums = result.groupby('year')['Popularity'].sum().reset_index()

# Arrange by year
Taylor_swift_popularity_yearly_sums = Taylor_swift_popularity_yearly_sums.sort_values(by='year')

# Print the result
print(Taylor_swift_popularity_yearly_sums)
    year  Popularity
0   2004         0.0
1   2005         0.5
2   2006        15.0
3   2007        52.0
4   2008       122.0
5   2009       265.0
6   2010       220.0
7   2011       194.0
8   2012       235.0
9   2013       220.0
10  2014       203.0
11  2015       275.0
12  2016       181.0
13  2017       150.0
14  2018       106.0
15  2019       126.0
16  2020       102.0
17  2021       115.0
18  2022       173.0
19  2023       564.0
20  2024       465.0
 # Generate a "synonym" by converting "2004" type values for year to 
"Two Thousand and Four"
'Two Thousand and Four'
# Chat GPT was used with prompt "I want to convert the year column from displaying "2001" to "two thousand and one" for each year" to generate the following function

import pandas as pd

def number_to_words(num):
    ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
    tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
    teens = ["ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]

    if num == 2000:
        return "two thousand"

    thousands = num // 1000
    remainder = num % 1000

    result = ""
    if thousands > 0:
        result += ones[thousands] + " thousand"

    if remainder >= 100:
        hundreds = remainder // 100
        result += " " + ones[hundreds] + " hundred"
        remainder %= 100

    if remainder > 0:
        if len(result) > 0:
            result += " and"
        if remainder < 20:
            if remainder < 10:
                result += " " + ones[remainder]
            else:
                result += " " + teens[remainder - 10]
        else:
            result += " " + tens[remainder // 10]
            if remainder % 10 != 0:
                result += " " + ones[remainder % 10]

    return result.strip()


Taylor_swift_popularity_yearly_sums['year'] = Taylor_swift_popularity_yearly_sums['year'].apply(number_to_words)

# Capitalize the first letter of each word
Taylor_swift_popularity_yearly_sums['year'] = Taylor_swift_popularity_yearly_sums['year'].apply(lambda x: ' '.join(word.capitalize() for word in x.split()))
print(Taylor_swift_popularity_yearly_sums)
                             year  Popularity
0           Two Thousand And Four         0.0
1           Two Thousand And Five         0.5
2            Two Thousand And Six        15.0
3          Two Thousand And Seven        52.0
4          Two Thousand And Eight       122.0
5           Two Thousand And Nine       265.0
6            Two Thousand And Ten       220.0
7         Two Thousand And Eleven       194.0
8         Two Thousand And Twelve       235.0
9       Two Thousand And Thirteen       220.0
10      Two Thousand And Fourteen       203.0
11       Two Thousand And Fifteen       275.0
12       Two Thousand And Sixteen       181.0
13     Two Thousand And Seventeen       150.0
14      Two Thousand And Eighteen       106.0
15      Two Thousand And Nineteen       126.0
16        Two Thousand And Twenty       102.0
17    Two Thousand And Twenty One       115.0
18    Two Thousand And Twenty Two       173.0
19  Two Thousand And Twenty Three       564.0
20   Two Thousand And Twenty Four       465.0
#Convert back to handle the synonym to alighn with the numeric form of year in our other data sources
import pandas as pd

# Define the mapping dictionary with the correct format
year_mapping = {
    "Two Thousand And Four": 2004,
    "Two Thousand And Five": 2005,
    "Two Thousand And Six": 2006,
    "Two Thousand And Seven": 2007,
    "Two Thousand And Eight": 2008,
    "Two Thousand And Nine": 2009,
    "Two Thousand And Ten": 2010,
    "Two Thousand And Eleven": 2011,
    "Two Thousand And Twelve": 2012,
    "Two Thousand And Thirteen": 2013,
    "Two Thousand And Fourteen": 2014,
    "Two Thousand And Fifteen": 2015,
    "Two Thousand And Sixteen": 2016,
    "Two Thousand And Seventeen": 2017,
    "Two Thousand And Eighteen": 2018,
    "Two Thousand And Nineteen": 2019,
    "Two Thousand And Twenty": 2020,
    "Two Thousand And Twenty One": 2021,
    "Two Thousand And Twenty Two": 2022,
    "Two Thousand And Twenty Three": 2023,
    "Two Thousand And Twenty Four": 2024
}

# Replace the 'year' column with numeric values
Taylor_swift_popularity_yearly_sums["year"] = Taylor_swift_popularity_yearly_sums["year"].map(year_mapping)
print(Taylor_swift_popularity_yearly_sums)
    year  Popularity
0   2004         0.0
1   2005         0.5
2   2006        15.0
3   2007        52.0
4   2008       122.0
5   2009       265.0
6   2010       220.0
7   2011       194.0
8   2012       235.0
9   2013       220.0
10  2014       203.0
11  2015       275.0
12  2016       181.0
13  2017       150.0
14  2018       106.0
15  2019       126.0
16  2020       102.0
17  2021       115.0
18  2022       173.0
19  2023       564.0
20  2024       465.0
import pandas as pd

# Load FBI Violent Crime Data
FBI_violent_crime_per_year = pd.read_excel("FBI_violent_crime_per_year.xls", skiprows=3)

# Print the result
print(FBI_violent_crime_per_year)
                                                 Year  ...  Unnamed: 23
0                                                2000  ...          NaN
1                                               20015  ...          NaN
2                                                2002  ...          NaN
3                                                2003  ...          NaN
4                                                2004  ...          NaN
5                                                2005  ...          NaN
6                                                2006  ...             
7                                                2007  ...          NaN
8                                                2008  ...          NaN
9                                                2009  ...          NaN
10                                               2010  ...          NaN
11                                               2011  ...          NaN
12                                               2012  ...          NaN
13                                               2013  ...          NaN
14                                               2014  ...          NaN
15                                               2015  ...          NaN
16                                               2016  ...          NaN
17                                               2017  ...          NaN
18                                              20186  ...          NaN
19                                               2019  ...          NaN
20  1 Populations are U.S. Census Bureau provision...  ...          NaN
21  2 The violent crime figures include the offens...  ...          NaN
22  3 The figures shown in this column for the off...  ...          NaN
23  4 The figures shown in this column for the off...  ...          NaN
24  5 The murder and nonnegligent homicides that o...  ...          NaN
25            6 The crime figures have been adjusted.  ...          NaN
26  NOTE:  Although arson data are included in the...  ...          NaN

[27 rows x 24 columns]
# Get the column names
columns = FBI_violent_crime_per_year.columns

# Find the 'Year' column (case-insensitive)
year_column = [col for col in columns if 'year' in col.lower()][0]

# Find the 'Violent crime rate' column (case-insensitive and ignoring newlines)
crime_rate_column = [col for col in columns if 'violent' in col.lower() and 'crime' in col.lower() and 'rate' in col.lower()][0]

# Select crime rate and year columns and rename them
new_data_frame = FBI_violent_crime_per_year[[year_column, crime_rate_column]].copy()
new_data_frame = new_data_frame.rename(columns={year_column: "year", crime_rate_column: "crime_rate"})

# Print the result
print(new_data_frame)
                                                 year  crime_rate
0                                                2000       506.5
1                                               20015       504.5
2                                                2002       494.4
3                                                2003       475.8
4                                                2004       463.2
5                                                2005       469.0
6                                                2006       479.3
7                                                2007       471.8
8                                                2008       458.6
9                                                2009       431.9
10                                               2010       404.5
11                                               2011       387.1
12                                               2012       387.8
13                                               2013       369.1
14                                               2014       361.6
15                                               2015       373.7
16                                               2016       386.6
17                                               2017       383.8
18                                              20186       370.4
19                                               2019       366.7
20  1 Populations are U.S. Census Bureau provision...         NaN
21  2 The violent crime figures include the offens...         NaN
22  3 The figures shown in this column for the off...         NaN
23  4 The figures shown in this column for the off...         NaN
24  5 The murder and nonnegligent homicides that o...         NaN
25            6 The crime figures have been adjusted.         NaN
26  NOTE:  Although arson data are included in the...         NaN
# Drop the last 7 rows as these are footnotes 
FBI_Crime_Rate = new_data_frame.iloc[:-7].copy()

# Fix the typos in the year column introduced by the footnotes 
FBI_Crime_Rate['year'] = FBI_Crime_Rate['year'].replace({20186: 2018, 20015: 2001})

# Convert year to integer type
FBI_Crime_Rate['year'] = FBI_Crime_Rate['year'].astype(int)

# Print the result
print(FBI_Crime_Rate)
     year  crime_rate
0    2000       506.5
1   20015       504.5
2    2002       494.4
3    2003       475.8
4    2004       463.2
5    2005       469.0
6    2006       479.3
7    2007       471.8
8    2008       458.6
9    2009       431.9
10   2010       404.5
11   2011       387.1
12   2012       387.8
13   2013       369.1
14   2014       361.6
15   2015       373.7
16   2016       386.6
17   2017       383.8
18  20186       370.4
19   2019       366.7
# Convert the year column to numeric
FBI_Crime_Rate['year'] = pd.to_numeric(FBI_Crime_Rate['year'], errors='coerce')

# View the result and check the structure
print(FBI_Crime_Rate)
     year  crime_rate
0    2000       506.5
1   20015       504.5
2    2002       494.4
3    2003       475.8
4    2004       463.2
5    2005       469.0
6    2006       479.3
7    2007       471.8
8    2008       458.6
9    2009       431.9
10   2010       404.5
11   2011       387.1
12   2012       387.8
13   2013       369.1
14   2014       361.6
15   2015       373.7
16   2016       386.6
17   2017       383.8
18  20186       370.4
19   2019       366.7
# Perform an inner join and rename columns in the combined dataframe
combined_data = FBI_Crime_Rate.merge(Taylor_swift_popularity_yearly_sums, on='year', how='inner')
combined_data = combined_data.rename(columns={
    'crime_rate': 'FBI Violent Crime',
    'total_count': 'Taylor Swift Popularity'
})

# View the result
print(combined_data)
    year  FBI Violent Crime  Popularity
0   2004              463.2         0.0
1   2005              469.0         0.5
2   2006              479.3        15.0
3   2007              471.8        52.0
4   2008              458.6       122.0
5   2009              431.9       265.0
6   2010              404.5       220.0
7   2011              387.1       194.0
8   2012              387.8       235.0
9   2013              369.1       220.0
10  2014              361.6       203.0
11  2015              373.7       275.0
12  2016              386.6       181.0
13  2017              383.8       150.0
14  2019              366.7       126.0
import pandas as pd

Overdose_table = pd.read_csv('overdose_table.csv')
print(Overdose_table)
                    year  number_of_deaths
0    2001. . . . . . . .             19394
1    2002. . . . . . . .             23518
2    2003. . . . . . . .             25785
3    2004. . . . . . . .             27424
4    2005. . . . . . . .             29813
5    2006. . . . . . . .             34425
6    2007. . . . . . . .             36010
7    2008. . . . . . . .             36450
8    2009. . . . . . . .             37004
9    2010. . . . . . . .             38329
10  2011 . . . . . . . .             41340
11   2012. . . . . . . .             41502
12   2013. . . . . . . .             43982
13   2014. . . . . . . .             47055
14   2015. . . . . . . .             52404
15   2016. . . . . . . .             63632
16   2017. . . . . . . .             70237
17   2018. . . . . . . .             67367
18   2019. . . . . . . .             70630
19   2020. . . . . . . .             91799
20   2021. . . . . . . .            106699
import pandas as pd
import re

# Remove "....." from year entries and convert to integer
Overdose_table['year'] = Overdose_table['year'].apply(lambda x: re.sub(r'[^\d]', '', str(x)))
Overdose_table['year'] = pd.to_numeric(Overdose_table['year'], errors='coerce').astype('Int64')

# Print the result
print(Overdose_table)
    year  number_of_deaths
0   2001             19394
1   2002             23518
2   2003             25785
3   2004             27424
4   2005             29813
5   2006             34425
6   2007             36010
7   2008             36450
8   2009             37004
9   2010             38329
10  2011             41340
11  2012             41502
12  2013             43982
13  2014             47055
14  2015             52404
15  2016             63632
16  2017             70237
17  2018             67367
18  2019             70630
19  2020             91799
20  2021            106699
# Select only Total deaths and Year columns
Overdose_table_subset = Overdose_table.iloc[:, :2].copy()

# Rename columns for clarity
Overdose_table_subset = Overdose_table_subset.rename(columns={Overdose_table_subset.columns[0]: 'year', Overdose_table_subset.columns[1]: 'number_of_deaths'})

# Display the first few rows of the new DataFrame
print(Overdose_table_subset.head())
   year  number_of_deaths
0  2001             19394
1  2002             23518
2  2003             25785
3  2004             27424
4  2005             29813
# Check the structure of the new DataFrame
print(Overdose_table_subset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              21 non-null     Int64
 1   number_of_deaths  21 non-null     int64
dtypes: Int64(1), int64(1)
memory usage: 485.0 bytes
None
import pandas as pd

# Join with Previous combined data and rename data
TaylorSwift_ViolentCrime_DrugOverdose = combined_data.merge(Overdose_table_subset, on='year', how='left')
TaylorSwift_ViolentCrime_DrugOverdose = TaylorSwift_ViolentCrime_DrugOverdose.rename(columns={'number_of_deaths': 'Drug Overdoses'})

# Display the first few rows of the joined dataframe
print(TaylorSwift_ViolentCrime_DrugOverdose)
    year  FBI Violent Crime  Popularity  Drug Overdoses
0   2004              463.2         0.0           27424
1   2005              469.0         0.5           29813
2   2006              479.3        15.0           34425
3   2007              471.8        52.0           36010
4   2008              458.6       122.0           36450
5   2009              431.9       265.0           37004
6   2010              404.5       220.0           38329
7   2011              387.1       194.0           41340
8   2012              387.8       235.0           41502
9   2013              369.1       220.0           43982
10  2014              361.6       203.0           47055
11  2015              373.7       275.0           52404
12  2016              386.6       181.0           63632
13  2017              383.8       150.0           70237
14  2019              366.7       126.0           70630

Combined data of interest: (2004-2019) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”

import pandas as pd
import matplotlib.pyplot as plt

data = TaylorSwift_ViolentCrime_DrugOverdose
df = pd.DataFrame(data)

# Function to scale values to range (0–100)
def scale_0_100(x):
    return ((x - x.min()) / (x.max() - x.min())) * 100

# Apply scaling to all columns except 'year'
scaled_data = df.copy()
for col in scaled_data.columns[1:]:
    scaled_data[col] = scale_0_100(scaled_data[col])
# Reshape the scaled data for plotting
scaled_data_long = scaled_data.melt(id_vars=['year'], var_name='variable', value_name='value')
# Create a time series line plot of the scaled values
plt.figure(figsize=(12, 6))

# Plot each metric's scaled value
for variable in scaled_data_long['variable'].unique():
    subset = scaled_data_long[scaled_data_long['variable'] == variable]
    plt.plot(subset['year'], subset['value'], marker='o', label=variable)

# Add labels and title
plt.title('Scaled Values of FBI Violent Crime,\nPopularity and Drug Overdoses (2004-2019)', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(scaled_data_long['year'].unique(), rotation=45)
([<matplotlib.axis.XTick object at 0x7fb163ee36d0>, <matplotlib.axis.XTick object at 0x7fb163ee3d60>, <matplotlib.axis.XTick object at 0x7fb163ef8550>, <matplotlib.axis.XTick object at 0x7fb161bb1310>, <matplotlib.axis.XTick object at 0x7fb161ba2130>, <matplotlib.axis.XTick object at 0x7fb161c1b9d0>, <matplotlib.axis.XTick object at 0x7fb161c0e850>, <matplotlib.axis.XTick object at 0x7fb161bb1dc0>, <matplotlib.axis.XTick object at 0x7fb161ba2df0>, <matplotlib.axis.XTick object at 0x7fb161c1b550>, <matplotlib.axis.XTick object at 0x7fb161bbcf70>, <matplotlib.axis.XTick object at 0x7fb161bc1a60>, <matplotlib.axis.XTick object at 0x7fb161bc7550>, <matplotlib.axis.XTick object at 0x7fb161bc1190>, <matplotlib.axis.XTick object at 0x7fb161bc7ee0>], [Text(2004, 0, '2004'), Text(2005, 0, '2005'), Text(2006, 0, '2006'), Text(2007, 0, '2007'), Text(2008, 0, '2008'), Text(2009, 0, '2009'), Text(2010, 0, '2010'), Text(2011, 0, '2011'), Text(2012, 0, '2012'), Text(2013, 0, '2013'), Text(2014, 0, '2014'), Text(2015, 0, '2015'), Text(2016, 0, '2016'), Text(2017, 0, '2017'), Text(2019, 0, '2019')])
plt.axhline(0, color='gray', linestyle='--', linewidth=1) # Add a baseline at y=0

# Add legend and grid
plt.legend(title='Metrics')
plt.grid(True)

# Show plot
plt.tight_layout()
plt.show()

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Define the data
data = TaylorSwift_ViolentCrime_DrugOverdose

# Create the DataFrame
df = pd.DataFrame(data)

# Function to scale values to range (0–100)
def scale_0_100(x):
    return ((x - x.min()) / (x.max() - x.min())) * 100

# Apply scaling to all columns except 'year'
scaled_data = df.copy()
for col in scaled_data.columns[1:]:
    scaled_data[col] = scale_0_100(scaled_data[col])

# Reshape data for plotting
scaled_data_long = scaled_data.melt(id_vars=['year'], var_name='variable', value_name='value')

# Create a bar chart
plt.figure(figsize=(14, 7))

# Set bar width and positions
bar_width = 0.25
x_positions = np.arange(len(scaled_data['year']))

# Plot each variable as a separate set of bars
for i, variable in enumerate(scaled_data.columns[1:]):
    plt.bar(x_positions + i * bar_width,
            scaled_data[variable],
            width=bar_width,
            label=variable)

# Add labels and title
plt.title('Scaled Values of FBI Violent Crime,\nPopularity and Drug Overdoses (2004-2019)', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(x_positions + bar_width / len(scaled_data.columns[1:]), scaled_data['year'], rotation=45)
([<matplotlib.axis.XTick object at 0x7fb161b15cd0>, <matplotlib.axis.XTick object at 0x7fb161b15e50>, <matplotlib.axis.XTick object at 0x7fb161b0cb50>, <matplotlib.axis.XTick object at 0x7fb161a35f10>, <matplotlib.axis.XTick object at 0x7fb161a44a30>, <matplotlib.axis.XTick object at 0x7fb161a49520>, <matplotlib.axis.XTick object at 0x7fb161a888e0>, <matplotlib.axis.XTick object at 0x7fb161a35640>, <matplotlib.axis.XTick object at 0x7fb161a49d00>, <matplotlib.axis.XTick object at 0x7fb161a527f0>, <matplotlib.axis.XTick object at 0x7fb161a572e0>, <matplotlib.axis.XTick object at 0x7fb161a356a0>, <matplotlib.axis.XTick object at 0x7fb161a57be0>, <matplotlib.axis.XTick object at 0x7fb161a5f6d0>, <matplotlib.axis.XTick object at 0x7fb161a5ff40>], [Text(0.08333333333333333, 0, '2004'), Text(1.0833333333333333, 0, '2005'), Text(2.0833333333333335, 0, '2006'), Text(3.0833333333333335, 0, '2007'), Text(4.083333333333333, 0, '2008'), Text(5.083333333333333, 0, '2009'), Text(6.083333333333333, 0, '2010'), Text(7.083333333333333, 0, '2011'), Text(8.083333333333334, 0, '2012'), Text(9.083333333333334, 0, '2013'), Text(10.083333333333334, 0, '2014'), Text(11.083333333333334, 0, '2015'), Text(12.083333333333334, 0, '2016'), Text(13.083333333333334, 0, '2017'), Text(14.083333333333334, 0, '2019')])
# Add legend and grid
plt.legend(title='Metrics')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.tight_layout()
plt.show()

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Define the data
data = TaylorSwift_ViolentCrime_DrugOverdose

# Create the DataFrame
df = pd.DataFrame(data)

# Function to scale values to range (0–100)
def scale_0_100(x):
    return ((x - x.min()) / (x.max() - x.min())) * 100

# Apply scaling to the columns
scaled_data = df.copy()
scaled_data["Popularity"] = scale_0_100(scaled_data["Popularity"])
scaled_data["Drug Overdoses"] = scale_0_100(scaled_data["Drug Overdoses"])

# Create a Scatter Plot
plt.figure(figsize=(12, 6))
plt.scatter(scaled_data["year"], scaled_data["Popularity"], label="Popularity", color='blue', marker='o')
plt.scatter(scaled_data["year"], scaled_data["Drug Overdoses"], label="Drug Overdoses", color='green', marker='x')

# Add trendlines
for variable in ["Popularity", "Drug Overdoses"]:
    z = np.polyfit(scaled_data["year"], scaled_data[variable], deg=1) # Linear fit
    p = np.poly1d(z)
    plt.plot(scaled_data["year"], p(scaled_data["year"]), label=f"{variable} Trendline", linestyle='--')

# Add labels and title
plt.title('Scaled Values Change Over Time for Taylor Swift Popularity and Drug Overdoses', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(scaled_data["year"], rotation=45)
([<matplotlib.axis.XTick object at 0x7fb1619b3790>, <matplotlib.axis.XTick object at 0x7fb1619b3670>, <matplotlib.axis.XTick object at 0x7fb16198ac40>, <matplotlib.axis.XTick object at 0x7fb161983760>, <matplotlib.axis.XTick object at 0x7fb161979760>, <matplotlib.axis.XTick object at 0x7fb161970a90>, <matplotlib.axis.XTick object at 0x7fb161983fa0>, <matplotlib.axis.XTick object at 0x7fb16198a7f0>, <matplotlib.axis.XTick object at 0x7fb1619b9d00>, <matplotlib.axis.XTick object at 0x7fb161997e80>, <matplotlib.axis.XTick object at 0x7fb1619a0970>, <matplotlib.axis.XTick object at 0x7fb16198a9d0>, <matplotlib.axis.XTick object at 0x7fb1619292b0>, <matplotlib.axis.XTick object at 0x7fb161929d60>, <matplotlib.axis.XTick object at 0x7fb16192e850>], [Text(2004, 0, '2004'), Text(2005, 0, '2005'), Text(2006, 0, '2006'), Text(2007, 0, '2007'), Text(2008, 0, '2008'), Text(2009, 0, '2009'), Text(2010, 0, '2010'), Text(2011, 0, '2011'), Text(2012, 0, '2012'), Text(2013, 0, '2013'), Text(2014, 0, '2014'), Text(2015, 0, '2015'), Text(2016, 0, '2016'), Text(2017, 0, '2017'), Text(2019, 0, '2019')])
plt.axhline(0, color='gray', linestyle='--', linewidth=1) # Add a baseline at y=0

# Add legend and grid
plt.legend(title='Metrics')
plt.grid(True)

# Show plot
plt.tight_layout()
plt.show()

Note: More analysis can be found under each code tab set, here we just show a summary.

Because our data all have different ranges we must scale the data to show percent changes over time as opposed to the raw data. For example the Violent crime rate is on the order of hundreds while the drug overdose deaths are on the order of 100,000. If the data is not scaled the temporal relationship between violent crime, drug use, and Taylor Swift Popularity is obscured, exactly as Taylor Swifts wants. As Data Scientists with a firm grasp on mathematics we are not so easily fooled.

#Scale all data to range from 0-100 for graphical visualization of relative trends per dataset
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = year, y = value, color = variable)) +
  geom_line(size = 1) +            # Line plot
  geom_point(size = 3) +           # Add points for each data point
  labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses Over Time",
       x = "Year", y = "Scaled Values") +
  theme_minimal() +                # Clean theme
  scale_color_manual(values = c("red", "pink", "black")) + # Custom colors
  theme(legend.title = element_blank()) + # Remove legend title
  theme(panel.grid.major = element_line(color = "gray", size = 0.5)) # Add gridlines

Figure(1) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”

#Same comparison but visualized with grouped bar chart
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the grouped bar plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
  labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses by Year",
       x = "Year", 
       y = "Scaled Values") +
  scale_fill_manual(values = c("FBI Violent Crime" = "red", 
                               "Taylor Swift Popularity" = "pink", 
                               "Drug Overdoses" = "black"),
                    name = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(limits = c(0, 100)) +
  geom_text(aes(label = round(value, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3)

Figure(2) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift

# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
  mutate(across(-year, scale_0_100))
# Step 2: Select only the columns for 'Taylor Swift Popularity' and 'Drug Overdoses'
scaled_data_subset <- scaled_data %>%
  select(year, `Taylor Swift Popularity`, `Drug Overdoses`)
# Step 3: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data_subset %>%
  pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 4: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
  labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses by Year",
       x = "Year", 
       y = "Scaled Values") +
  scale_fill_manual(values = c("Taylor Swift Popularity" = "pink", 
                               "Drug Overdoses" = "black"),
                    name = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(limits = c(0, 100)) +
  geom_text(aes(label = round(value, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3)

Figure(3) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift

Here we see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.

#Show correlation between drug overdose and popularity with a scatterplot/trendline
# Create the plot
ggplot(scaled_data, aes(x = year)) +
  # Scatter plot for Taylor Swift Popularity
  geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
  # Scatter plot for Drug Overdoses
  geom_point(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), size = 3, shape = 15) +
  # Fitted line for Taylor Swift Popularity
  geom_smooth(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), 
              method = "lm", se = FALSE, linetype = "dashed") +
  # Fitted line for Drug Overdoses
  geom_smooth(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), 
              method = "lm", se = FALSE, linetype = "dashed") +
  # Customize colors
  scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "Drug Overdoses" = "black")) +
  # Labels and title
  labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses Over Time",
       x = "Year",
       y = "Scaled Value (0 to 100)",
       color = "") +
  # Customize the theme
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.major = element_line(color = "gray", linetype = "dotted"),
        axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  # Set y-axis limits
  scale_y_continuous(limits = c(0, 100)) +
  # Set x-axis to show all years
  scale_x_continuous(breaks = scaled_data$year)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_smooth()`).

Figure(4) Scatterplot with trendline showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift

We again see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.

# Create the plot
ggplot(scaled_data, aes(x = year)) +
  # Line and points for Taylor Swift Popularity
  geom_line(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 1) +
  geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
  # Line and points for Violent Crime Rate
  geom_line(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 1) +
  geom_point(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 3, shape = 15) +
  # Customize colors
  scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "FBI Violent Crime" = "red")) +
  # Labels and title
  labs(title = "Scaled Values of Taylor Swift Popularity and FBI Violent Crime Over Time",
       x = "Year",
       y = "Scaled Value (0 to 100)",
       color = "") +
  # Customize the theme
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid = element_blank(),  # Remove all grid lines
        axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  # Set y-axis limits
  scale_y_continuous(limits = c(0, 100)) +
  # Set x-axis to show all years
  scale_x_continuous(breaks = scaled_data$year)

Figure(5) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift

Shockingly Ms. Swifts Music seems to have a calming affect on American citizens. Perhaps Taylor is sending Americans into a drug-ridden stupor.

Altogether,this work provides unequivocal evidence of Taylor Swifts corruption of America. Ms. Swift’s music lulls masses into a a drug ridden complaceny.

Future work needs to investigate the negative correlation between Ms. Swift’s music and Violent Crime Rate in America. One plausible explanation is that Ms. Swift’s music triggers downregulation of testosterone production in males, leading to reduced violent crime.

Most of the challenges with the wrangling were introduced by raw data formatting issues. For example footnotes in the FBI crime data introduced typos that needed mutation to remove the additional number. Additionally the FBI data had invisible charaters in the column names which made it confusing when trying to select or extract specific columns. The Google Trends data required wider separation, grouping by year, and aggregation to yield usable data. The CDC overdose data needed extraction of only specific columns.

It was also challenging to join the data sources into a single dataframe as some data was of the structure numeric while others were characters. This required conversion of characters to numeric for subsequent calculation during analysis. SQL was particularly difficult at first as I was unfamiliar with how to setup the database connection.

Compiling the python, r, and sql code into one html document was challenging and required quite a bit of formatting to get to a point where I was happy with it.

It was challenging to visualize the trends the data during analysis due to the multiple order of magnitude difference in the scales of our data. For example, violent crime rate has value on the scale of hundreds while overdose deaths per year are in the hundreds of thousands. This was fixed by applying a scaling function to have all data range from 0 to 100 to show year over year percent changes.

The data analysis pipeline was successfully implemented in R, SQL, Python, & Excel. In addition the project was compiled into one master html document to present all information in the various languages (excel attached separate). Before this project I had never coded in Python past “Hello World”, never had used R, never even heard of SQL, and had certainly never generated an html document. I now feel confident using all of these tools. This project also displays a grasp of statistical analysis and how to analyze data. The inherent disconnection between the data in this project made the analysis a creative process.

I also feel comfortable using terms like extract, separate wider/longer, aggregate, group by, etc when talking about data wrangling during my professional and conversational life.

For particular challenging problems such as the initial generation of the year synonym for the Taylor Swift Google Trends data, data scaling, and graphical analyiss AI was used (ChatGPT) to assist in writing functions. This was empowering as AI can be a powerful collaborator.